NPS-OR-01-011 


NAVAL  POSTGRADUATE  SCHOOL 
Monterey,  California 


Stochastic  Models  for 
Promoting  and  Testing  System 
Reliability  Evolution 

by 

Donald  P.  Gaver 
Patricia  A.  Jacobs 
Ernest  A.  Seglie 

September  2001 


Approved  for  public  release;  distribution  is  unlimited. 

Prepared  for  Naval  Postgraduate  School 
Monterey,  California  93943 


20011016  074 


NAVAL  POSTGRADUATE  SCHOOL 
MONTEREY,  CA  93943-5000 


RADM  David  R.  Ellison  Richard  Elster 

Superintendent  Provost 


This  report  was  prepared  for  and  funded  by  the  Director,  Operational  Test  and  Evaluation 
(DOT&E),  The  Pentagon  (Room  3E318),  Washington,  DC  20301-1700.  Research  also 
supported  by  the  Institute  of  Joint  Warfare  Analysis  (UWA)  and  the  Modeling,  Virtual 
Environments  and  Simulation  (The  MOVES)  Institute  at  the  Naval  Postgraduate  School. 

Reproduction  of  all  or  part  of  this  report  is  authorized. 


This  report  was  prepared  by: 


DONALD  P.  GAVER 
Distinguished  Professor  of 
Operations  Research 


^9  h 


3- 

ERNESTA 
Director,  Operational  Test  and  Evaluation 


PATRICIA  A.  JAQ0BS 
Professor  of  Operations  Research 


Associate  Chairman  for  Research 
Department  of  Operations  Research 


Department  of  Operations  Research 


Associate  Provost  and  Dean  of  Research 


REPORT  DOCUMENTATION  PAGE 


Form  approved 
0MB  No  0704-0188 


Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources, 
gathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection 
of  information,  including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  information  Operations  and  Report,  1215  Jefferson  Davis  Highway, 
Suite  1204,  Arlington,  VA  22202-4302,  and  to  the  Office  of  Management  and  Budget,  Paperwork  Reduction  Project  (0704-0188),  Washington,  DC  20503. 


1.  AGENCY  USE  ONLY  (Leave  blank)  2.  REPORT  DATE  3.  REPORT  TYPE  AND  DATES  COVERED 

September  2001  Technical  Report 


4.  TITLE  AND  SUBTITLE 

Stochastic  Models  for  Promoting  and  Testing  System  Reliability  Evolution 


6.  AUTHOR(S) 

Donald  P.  Gaver,  Patricia  A  Jacobs,  Ernest  A.  Seglie 


5.  FUNDING 


MDPR  NO.  DVM^IOOOl 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 
Naval  Postgraduate  School 
Monterey,  CA  93943 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 
NPS-OR-01-011 


9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 
Office  of  the  Director,  Operational  Test  and  Evaluation  (DOT&E) 

The  Pentagon  (Room  3E3 1 8) 

Washington,  DC  20301-1700 


10.  SPONSORING/MONITORING 
AGENCY  REPORT  NUMBER 


12a.  DISTRIBUTION/AVAILABILITY  STATEMENT 


12b.  DISTRIBUTION  CODE 


13.  ABSTRACT  (Maximum  200  words.) 

Many  systems  and  systems-of-systems  function  in  sequential-stage  fashion,  and  are  constantly  on  when  operative,  but  are 
failure-susceptible.  Communication  systems,  power  generation  and  transmission,  and  vehicular  transportation  systems  tend 
to  fall  into  this  category.  We  propose  a  reliability  growth  model  for  such  systems  that  is  based  on  design  defect  removal 
under  a  Test-Fix-Test  (TFT)  protocol:  a  system  is  assembled  and  put  under  test,  for  exan^le  for  a  fixed  mission  time,  or 
multiple  thereof.  If  the  system  fails  during  the  test  time  its  failure  source  in  some  stage  is  diagnosed,  the  stage  is  re-designed, 
and  the  new  prototype  system  reassembled  (system  design  is  “fixed”)  and  the  system  is  re-tested.  The  test  (TFT)  process  is 
repeated  until  a  pre-determined  test  period  elapses  with  no  failures.  This  is  analogous  to  the  run-test  criteria  analyzed  for 
one-shot  devices  [1].  In  this  model  we  also  allow  for  occasional  defective  re-design:  response  to  a  test  failure  can  actually 
(and  realistically)  increase  the  number  of  failure-generating  design  defects. 

Our  model  allows  quick  numerical  assessment  of  TFT  operating  characteristics,  given  defining  parameter  values.  It  thus 
provides  a  planning  tool  for  test  designers. 


14.  SUBJECT  TERMS 

Operational  test  and  evaluation;  reliability  growth;  software  testing;  system  with  substages 


15.  NUMBER  OF 
PAGES 

15 


16.  PRICE  CODE 


17.  SECURITY  CLASSIFICATION 
OF  REPORT 

Unclassified 


NSN  7540-01-280-5800 


18.  SECURITY  CLASSIFICATION 
OF  THIS  PAGE 

Unclassified 


19.  SECURITY  CLASSIFICATION  20.  LIMITATION  OF 
OF  ABSTRACT  ABSTRACT 

Unclassified  UL 


Standard  Form  298  (Rev.  2-89) 
Prescribed  by  ANSI  Std  239-18 


Stochastic  Models  for  Promoting  and  Testing  System 
Reliability  Evolution 


Donald  P.  Gaver 
Operations  Research  Department 
Naval  Potsgraduate  School 
Monterey,  CA  93943 
Email:  dgaver@nps.navy.mil 


Patricia  A.  Jacobs 
Operations  Research  Department 
Naval  Postgraduate  School 
Monterey,  CA  93943 
Email:  pajacobs@nps.navy.mil 


Ernest  A.  Seglie 

Director,  Operational  Test  and  Evaluation 
The  Pentagon 
Washington,  DC  20301 
eseglie@dote.osd.mil 


ABSTRACT 

Many  systems  and  systems-of-systems  function  in  sequential-stage 
fashion,  and  are  constantly  on  when  operative,  but  are  failure- 
susceptible.  Communication  systems,  power  generation  and 
transmission,  and  vehicular  transportation  systems  tend  to  fall  into  this 
category.  We  propose  a  reliability  growth  model  for  such  systems  that 
is  based  on  design  defect  removal  imder  a  Test-Fix-Test  (TFT) 
protocol:  a  system  is  assembled  and  put  under  test,  for  example  for  a 
fixed  mission  time,  or  multiple  thereof  If  the  system  fails  during  the 
test  time  its  failure  source  in  some  stage  is  diagnosed,  the  stage  is  re¬ 
designed,  and  the  new  prototype  system  reassembled  (system  design  is 
“fixed”)  and  the  system  is  re-tested.  The  test  (TFT)  process  is  repeated 
until  a  pre-determined  test  period  elapses  with  no  failures.  This  is 
analogous  to  the  run-test  criteria  analyzed  for  one-shot  devices  [1].  In 
this  model  we  also  allow  for  occasional  defective  re-design:  response 
to  a  test  failure  can  actually  (and  realistically)  increase  the  number  the 
number  of  failure-generating  design  defects. 

Our  model  allows  quick  numerical  understanding  of  TFT  operating 
characteristics,  given  defining  parameter  values.  It  thus  provides  a 
planning  tool  for  test  designers. 
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1.  Introduction  and  Model  Formulation 

Mathematical  models  are  formulated  for  the  reliability  evolution  (desirably 
growth  [2],  [3],  [4],  [5],  but  also  occasional  realistic  decay)  of  a  continuously 
operating  (“always  on”)  system  that  is  tested,  fixed  (partially  re-designed)  if  it  fails, 
re-tested,  etc,  until  a  specified  stopping  condition  is  achieved.  The  stopping  rule 

fi: 

utilized  here  is  analogous  to  a  run  test  [1],  [6];  here  the  entire  system  must  survive 
without  any  failure  for  a  time  r  in  order  to  pass  the  test,  have  its  design  frozen,  and  be 
eligible  for  operational  testing  and  eventual  usage  in  the  field. 

Two  test  measures  of  effectiveness  (MOEs)  are  analytically  evaluated: 

(a)  the  probability  that  the  system  survives  in  the  field,  i.e.,  after  the 
end-to-end  testing  period  of  specified  duration  r  is  survived  without  failure, 
and  the  design  is  frozen;  and 

(b)  the  expected  duration  of  such  a  test. 

It  is  also  possible  to  analytically  evaluate  other  such  measures  by  our  backward 
equation  technique:  the  variance  of  test  duration,  the  probability  distribution  of 
remaining  design  defects  or  faults,  and  so  forth.  All  of  these  measures  are  evaluated 
in  terms  of  basic  parameters,  such  as  the  initial  number  of  design-fault-susceptible 
modules  per  stage  {di  for  stage  i)  the  maximum  number  per  stage  (w,),  the  rate  of 
design  fault  activation,  hence  failure  per  design  fault  module  (2,),  the  number  of 
sequential  stages  (5),  the  duration  of  the  fault-free  test  interval  that  must  be  survived 
in  order  to  pass  the  test  (r)  (specified  in  advance  by  the  plaimer/analyst),  the 
probability  of  effective  re-design/fault  removal  (  p, ),  and  the  probability  of  ineffective 
re-design/fault  addition  ( a, ).  In  the  present  model  study  the  analyst  must  furnish 
values  for  these  basic  “what  if’  parameters,  and  the  model  then  evaluates  the  MOEs 
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(a)  and  (b).  The  model  is  also  extended  to  account  for  test-to-test  environmental 
variability  both  random  and  systematic. 

Our  model  can  also  form  the  basis  for  statistical  inference  concerning  stage-wise 
fault  population  parameters.  Given  observations  on  failures  at  various  stages,  a 
likelihood  function  can  be  written  down  and  analyzed,  possibly  making  use  of 
Bayesian  methodology.  This  will  be  a  topic  for  future  research. 

2.  Generic  Situation:  Staged  Systems  in  Continuous  Time  Under 
Test-Fix-Test 

Consider  a  system,  S,  that  is  made  up  of  S  stages.  Si,  S,  ■■■,  ■■■,  <^5  the 

(z=l,  ...  S),  of  which  has  a  maximum  number  of  modules,  /w,-,  all  of  which  must 
operate  for  the  stage  Si  to  be  operative.  However,  stage  i  initially  has 
di,  (l<J,  <w,),  design  defects,  i.e.,  improperly  designed  failure  prone  modules. 
These  are  presumed  to  activate  independently  and  randomly  as  exposure  (test,  or  field 
operation)  time  elapses.  Initially  we  presume  the  time  to  (activation/failure)  of  each 
design  defect  in  stage  i  to  be  exponentially  distributed,  with  rate  The  m,.  - 

modules  without  defects  at  Si  are  assumed  (for  now)  not  to  be  failure-susceptible. 

It  is  here  assumed  that  if  the  system  S  is  put  on  test  at  t=0  it  operates  successfully 
until  the  first  design-defective  module  in  any  stage  activates/fails',  when  that  module 
fails,  S  fails  (no  redundancy).  Occurrence  of  such  activation  is  an  opportunity  for 

re-design  (permanent  or  temporary  repair)  of  the  failed  module.  If  this  step  is  (z) 
positively  effective  the  module  is  no  longer  activation/failure-prone,  i.e.,  di  is 
decreased  by  one;  if  this  step  is  (zz)  negatively  effective,  the  re-design  is  not  only 
ineffective,  it  adds  a  defective  module,  so  the  net  number  of  defects  is  increased  by 
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one;  otherwise  the  re-design  is  (in)  infective,  meaning  that  there  is  no  change  in  the 
number  of  defective  modules.  Note  that  the  above  defect-removal/re-design  option  is 
only  available  at  the  testing  (developmental,  early  operational)  stage. 

2.1  Test  Protocol 

We  analyze  properties  of  a  no-mission  time  failure  test  protocol:  specify  a  test 
time,  T,  and  test  the  system  for  that  time.  Each  such  test  event  is  called  a  subtest.  If  a 
failure  occurs  during  that  subtest,  perform  re-design  and  test  again,  continuing  until 
the  system  survives  for  time  t  without  failure.  At  this  moment  the  test  is  complete 
and  the  design  is  frozen.  This  is  clearly  analogous  for  the  run  of  r  criteria 
analyzed  [1]. 

There  are  two  simple  versions  of  this  protocol. 

(A)  The  subtests  all  last  for  the  basic  test  time  r,  even  if  a  failure  occurs  during 
a  subtest  and  the  subtest  has  failed  at  that  point.  For  the  present  we 
consider  just  one  failure  to  be  possible  during  a  subtest.  Generalizations 
will  be  furnished  later. 

(B)  The  subtests  each  last  until  the  time  to  first  failure  or  time  r,  whichever 
occurs  first. 

This  requires  that  the  system  be  constantly  monitored  in  real  time  to  discover 
failure  occurrence;  if  this  is  feasible  it  is  undoubtedly  more  time  efficient.  But 
operational  circumstances  may  compel  the  use  of  (A).  It  is  the  version  we 
analyze  first. 
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3.  Test  Protocol  Modeling  and  Measures  of  Test  Effeciveness  Under 
Protocol  (A) 

The  system  model  and  test  protocol  yield  expressions  that  allow  numerical 
evaluation  of  rheasures  of  system  test-fix,  etc,  effectiveness. 

3.1  Probability  of  Fielded  (Design-Frozen)  Success 

Let  p^{d^,...,d.,...,d^)=  Probability  that  the  tested  and  accepted  (design-frozen) 

system  survives  without  failure  for  time  xp. 

Then  by  probability  arguments  that  proceed  from  the  first  subtest  (backward 
equation  approach)  we  obtain 


/  5  V  i 


Pt{d\,...,di,...,ds)  = 


1  =  1 


V 


J 


V 


+ 


probability  no  probability  no 
failures,  so  no  field  failures 
re-designs 


+ 


\  —  e 


V 


J 


^idi 

2^~s 

k=\ 


Pi{di^Pf{d\,...  ,di  1,.,.,^/^)  “h 


defect  in  module  i  removed 


+  ai{di)pT:{d\,...,di  -l-l,...,c?5)  + 

—  "V"  ' 

new  defect  introduced  by  "redesign" 


(3.1) 


+  (l  Pi{di^  (Zi(diyjp^(^d\,...,di  ,.,.,ds^^ 
no  change  in  number  of  defects 


where  /lif  is  the  failure  rate  in  the  field  of  a  remaining  design  defect  in  stage  i. 

The  conditional  probability  of  defect  removal  (p)  and  addition  (  at )  are 
assumed  to  be 


Pi{di)  =  Pi  for  1  <  rf,  <  m,. 
=  0  otherwise 


(3.2, a) 
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(3.2, b) 


ai{d)  =  a,,  for  \<d^<  m.  - 1 
=  0  otherwise 

In  the  present  model  d.  <  w,.  for  all  stages,  where  m,  is  the  specified  maximum 
number  of  defects  in  stage  i.  The  above  expression  may  be  recursively  solved, 
starting  with pfQ,  0, ...  0)  =  1 ;  (3.2,b)  prevents  the  number  of  defects  from  exceeding 
nii  in  stage  i. 


3.2  Expected  Test  Duration,  Protocol  (A)  (Each  Subtest  Requires  Time  t) 

Let  wfdy,d2,.-,d^,...,d^)=  Expected/mean  time  to  complete  a  test  that 

terminates  with  system  first  failure-free  survival  of  time  t. 

Then  again  by  arguing  from  the  first  subtest 
wXd^,d^,...,d.,...,ds)  =  T  + 


+ 


k=\ 


+  ald)wXd„...,d,+l,...d^)  + 

+  (1  -  aK)  -  a,{d))wXd„...,d„...,ds)] 


^5)  + 


(3.3) 


Here  the  initial/boundary  condition  is  Wj.(0,...,0,...0)  =  r  . 

3.3  Generalization  for  Between-Test  Variability 

It  is  possible  to  explicitly  account  for  an  additional  likely  source  of  variability: 
subtest  environmental  variation,  represented  by  a  sequence  of  positive  independent 
identically  distributed  random  variables,  where  t  denotes  the  subtest 

number.  Illustrate  by  generalizing  (4.1).  Conditional  on  the  values 
deconditioning  subtest  by  subtest. 
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pXd,,d^,...,ds)=  E[p,[d„...,ds,e,,...,e„  ..)]  = 


■  =  E\ 


“  r 
e 


f 

(S  ~ 

-  \e 

l-E 

e  ^ 

\ 

J 

\^F 


s  2  H 

-l.-.<'s)  +  (4.4) 

'^^kdk 


+  a,{d;)pXd„...,d,+l,...,d^)  + 

+  (1  -  Pi{d)  -  a,{d))pXd„...,d.^...,d,)] 


4.  Test  Protocol  Modeling  Under  Protocol  (B)  (Each  Subtest  Requires 
the  Time  to  Failure  or  x,  Whichever  Occurs  First) 

In  this  protocol  it  is  possible  to  correctly  detect  a  failure  in  Stage  i  when  it 
occurs,  without  waiting  until  the  end  of  the  test. 

4.1  Probability  of  Fielded  (Design  Frozen)  Success 

If  pXd^,d^...,  d^,...ds)  is  defined  as  in  Section  4.1,  then  the  backward  equation 

for  this  function  is  the  same  as  in  (4.1).  Furthermore,  the  expression  (4.4)  that 
incorporates  independent  between-test  variability  holds  for  this  situation  also. 

4.2  Expected  Test  Duration,  Protocol  (B) 

Define  w^{d^,d^...,d^,...d^)  to  be  the  mean  time  to  test  termination  (after  the 
system  survives  time  r).  Then  in  this  situation  the  backward  equation  becomes 


wXd^,d^...,d„...ds)  =  Te 


5^  -r  XkdkX 


X^didx\x  +  p{d,)w^{d^,...,d.-l, 


+  a.[d^)wXd^,...,d.+\,...,ds)  + 

+  (1  -  Pi{di)  -  ald,))wXd^,...,d„...,ds)\ 


ds)  + 
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This  simplifies  to 


-'ZMr 

l-e 


+ 


l-e 


s  2/1 

. d.-l . d,)* 

-1  Y^XAk 

k=\ 


(5.1) 


+  a,{d.)wXd„...,di+\,.:,ds)^ 


+  (1  -  A«)  -  «/«)KK. ^5)]- 

To  generalize  (5.1)  to  account  for  between-test  variability  it  is  only  necessary  to 
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replace  the  first  term-^ 


1=1 


-Zv," 
term,  e  by  E\ 


(  S 
o  V  »=1 


r 

s 

- 

-Zv,*- 

l-e 

<•>1 

,hyE 

- 

I  -  e 


f  S 

-  e 

,  V  »-l  J 


eixA 

i=l 


and,  in  the  second 


,  where  the  expectation  is  on  6. 


LV 

Any  distribution  having  positive  support  and  with  an  explicit  Laplace-Stieltjes 
transform  provides  tractable  closed  form  models  for  Protocol  (A).  To  obtain  a 

closed-form  expression  for  Protocol  (B)  it  must  be  possible  to  integrate  the  Laplace 

s 

transform  of  0  from  zero  to  a  finite  limit^  A-idit. 

1=1 

5.  Illustrative  Numerical  Example 

The  backward  equations  may  be  solved  iteratively  to  provide  numerical  insights 
into  system  performance  under  the  TFT  testing  protocol.  Here  is  a  brief,  isolated,  but 
suggestive  example. 
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5.1  A  Test-Stage  Situation 


The  parameters  used  are  the  following; 


A,  =  0.01,  =  0.05 

Pi=p2= 

a,  =  0.20,  rZj  =  0.10 

m,  =  /Wj  =  4 

Tp  =  100 

A.^p  =  0.05,  =  0.05 


{Test  defect  activation  rate  (hours)'' ) 

{Defect  rectification/correction  probability) 

( Defect  mis -identification!  addition  of  one  defect  probability ) 
{Maximum  number  of  defects  in  each  stage) 

(The  field  mission  time  (hours)) 

( Field  defect  activation  rate  (hours)'' ) 


(A)  Examine  the  effect  of  the  basic  sub-test  time,  r ,  on  the  probability  of 
surviving  a  field  operation  without  failure.  The  numbers  in  the  small  table  below 
indicate  the  surprisingly  systematic  effect  of  test  duration  on  probability  of  successful 
field  operation. 

Table  1:  Probability  of  Surviving  rp  (Field  Operation) 


Initial  Defects  {d\) 
di  d2 

Test  Time  (r) 

50 

100 

200 

300 

0.89 

0.99 

1.00 

1.00 

0  1 

(119) 

(252) 

(508) 

(761) 

[70] 

[127] 

[229] 

[329] 

0.25 

0.52 

2  2 

(276) 

(651) 

(1462) 

(2264) 

[106] 

[375] _ 

0.55 

2  4 

(426) 

(945) 

(3119) 

_ [^ 

[513] _ 

0.81 

4  4 

(519) 

(2625) 

[139] 

[261] 

_ [4511 _ 

_ [593] _ 

(  )=Expected  Test  Time,  Protocol  (A) 

[  ]=Expected  Test  Time,  Protocol  (B) 

However,  the  required  number  of  tests  tends  to  increase  substantially  particularly 
under  Protocol  (A).  If  the  test  can  be  stopped  as  soon  as  a  failure  occurs,  considerable 
time  can  be  saved.  The  moral  is  that  only  by  considerable  testing  and  fixing  (in  an 
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error-prone  “fix”  environment)  can  we  eventually  hope  to  have  a  highly  reliable 
(small,  two-stage)  system. 

Software  that  can  be  activated  to  exercise  programs  to  evaluate  various  situations 
(and  parameter  variations)  appears  at  http://www.nps.navv.mil/opnsrsch/testeval/. 
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